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Abstract 

The  problem  of  recognizing  objects  imaged  in  complex  real-world  scenes  is  examined  from 
a  parametric  perspective  using  the  theory  of  statistical  estimation.  A  scalar  measure  of  an 
object’s  complexity ,  which  is  invariant  under  affine  transformation  and  changes  in  image  noise 
level,  is  extracted  from  the  object’s  Fisher  information.  The  volume  of  Fisher  information  is 
shown  to  provide  an  overall  statistical  measure  of  the  object’s  recognizability  in  a  particular 
image,  while  the  complexity  provides  an  intrinsically  physical  measure  that  characterizes  the 
object  in  any  image.  An  information-conserving  method  is  then  developed  for  recognizing 
an  object  imaged  in  a  complex  scene.  Here  the  term  “information-conserving”  means  that 
the  method  uses  all  the  measured  data  pertinent  to  the  object’s  recognizability,  attains  the 
theoretical  lower  bound  on  estimation  error  for  any  unbiased  estimate  of  the  parameter  vector 
describing  the  object,  and  therefore  is  statistically  optimal.  This  method  is  then  successfully 
applied  to  finding  objects  imaged  in  thousands  of  complex  real-world  scenes. 
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1  Introduction 


Charge- coupled  device  (CCD)  cameras  typically  produce  scene  images  with  extremely  low 
but  nonzero  noise  variance.  In  fact,  for  object  recognition  purposes  in  computer  vision,  an 
initial  assumption  often  is  that  the  noise  can  be  neglected  so  that  the  data  at  each  pixel  can 
be  regarded  as  deterministic. 

In  the  present  investigation,  however,  we  take  an  alternative  approach  that  follows  a 
strictly  physical  interpretation  of  classical  estimation  theory.  First,  we  use  experimental  data 
to  determine  the  joint  probability  distribution  of  the  pixel  brightness  measurements  in  our 
CCD  images.  We  use  this  to  construct  the  likelihood  function  for  any  parameter  set  that  is 
to  be  estimated  given  our  image  data.  It  is  significant  that  the  form  of  the  likelihood  function 
in  this  physical  approach  is  not  arbitrary,  but  depends  upon  the  probability  distribution  of 
the  brightness  measurements  no  matter  how  low  the  corresponding  noise  variance  is  at  each 
pixel,  as  long  as  it  is  nonzero.  Moreover,  it  is  the  form  of  this  likelihood  function,  not  the 
level  of  the  noise,  that  determines  the  optimal  method  of  recognizing  an  imaged  object. 

To  emphasize  these  issues,  we  show  how  a  scalar  measure  of  an  object’s  complexity ,  which 
is  invariant  under  affine  transformation  and  changes  in  image  noise  level,  can  be  extracted 
from  the  determinant  of  the  object’s  Fisher  information  matrix.  The  volume  of  Fisher 
information  is  shown  to  provide  an  overall  statistical  measure  of  the  object’s  recognizability 
in  a  particular  image,  while  the  complexity  provides  an  intrinsically  physical  measure  that 
characterizes  the  object  in  any  image.  We  then  derive  a  method  of  recognizing  an  object 
in  a  complex  scene  that  attains  the  theoretical  lower  bound  on  mean-square  error  for  any 
unbiased  estimate  of  the  object’s  parameter  vector,  and  therefore  is  by  definition  statistically 
optimal  and  information-conserving.  From  a  computer  vision  perspective,  we  consider  the 
information-conserving  property  of  this  estimator  to  be  most  significant  because  it  assures 
that  the  method  uses  all  the  measured  data  pertinent  to  the  object’s  recognizability  regardless 
of  the  noise  level.  Many  popular  edge-based  methods,  for  example,  discard  a  significant 
amount  of  information  pertinent  to  an  object’s  recognizability  and  are  therefore  inherently 
sub-optimal. 

To  illustrate  our  approach,  we  focus  attention  in  the  present  paper  on  the  problem  of 
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recognizing  objects  that  are  uniquely  determined  by  the  six  parameters  of  an  affine  transfor¬ 
mation  as  well  as  a  seventh  parameter  that  identifies  the  class  of  the  object.  Here,  the  affine 
transformation  describes  rigid  body  motion  and  linear  distortion  of  a  model  object,  while 
the  class  distinguishes  it  from  other  objects  with  the  same  affine  parameters.  For  inherently 
three-dimensional  objects,  the  class  must  be  supplemented  by  further  parameterizations  that 
account  for  such  effects  as  variation  in  shading  caused  by  changes  in  surface  orientation  with 
respect  to  a  given  source  distribution  and  receiver  geometry.  For  the  recognition  of  flat 
objects  in  real  world  scenes,  however,  we  show  that  such  ancillary  parameterizations  are  un¬ 
necessary  so  long  as  the  object  does  not  have  a  purely  specular  surface.  This  is  because  the 
optimal  estimator  for  the  affine  parameters  takes  the  form  of  a  weighted  filter  that  is  invari¬ 
ant  to  the  uniform  variations  in  shading  characteristic  of  such  flat  objects.  This  weighting  is 
also  necessary  to  discriminate  against  image  ambiguities  that  are  not  explicitly  accounted  for 
in  classical  estimation  theory.  It  is  significant  that  these  image  ambiguities  make  the  recog¬ 
nition  problem  inherently  nonlinear.  A  global  optimization  procedure  is  therefore  necessary 
to  compute  the  filter  output  and  obtain  the  optimal  estimate. 

Our  method’s  performance  is  evaluated  experimentally  by  applying  it  to  the  problem  of 
recognizing  traffic  signs  in  images  of  complicated  outdoor  scenes.  In  both  our  theoretical  and 
experimental  analysis,  we  find  that  recognizability  is  strongly  dependent  upon  the  object’s 
complexity.  We  show  how  this  measure  becomes  analogous  to  the  complexity  traditionally 
referred  to  in  signal  processing  when  the  affine  transformation  is  reduced  to  a  1-D  shift  in 
the  position  of  a  1-D  object. 

2  The  statistics  of  image  brightness 

Charge-coupled  device  (CCD)  cameras  do  not  output  the  intensity  W  of  light.  Instead, 
they  output  a  power-transformed  intensity  on  an  8-bit  grey-scale  which  we  refer  to  as  image 
brightness  I(x,  y).  The  brightness  is  linearly  proportional  to  W~'y{x,  y)  where  7  is  a  “gamma 
correction,”  e.g.,  7  =  2.2  [18].  The  purpose  of  this  transformation  is  to  correct  for  the 
response  of  cathode-ray  tube  monitors  so  that  the  output  of  any  monitor  is  proportional  to 
intensity. 
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Experiments  with  the  CCD  video  camera  used  in  our  vision  system  indicate  that  the 
standard  deviation  cr(x,y)  of  the  output  I{x,y)  is  not  only  small  compared  to  the  mean 
m(x,y),  but,  as  shown  in  Figure  1,  does  not  depend  on  the  mean  or  on  position  (x,  y).  The 
noise,  therefore,  is  additive  and  signal-independent,  such  that  cr(x,y )  =  a.  We  speculate 
that  the  noise  is  due  to  small  mechanical  vibrations  between  source  and  receiver,  as  well  as 
electronic  shot  noise.  Thermally  induced  fluctuations  of  natural  light,  however,  are  not  a 
significant  cause  of  errors  in  our  measurements  as  is  shown  in  Appendix  A. 
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Figure  1:  The  measured  mean  and  standard  deviation  of  the  image  brightness  7  as  a  function 
of  the  mean.  The  sample  standard  deviation  is  signal-independent  and  obtained  by  averaging 
hundreds  of  images  of  outdoor  scenes.  The  average  standard  deviation  is  2.65. 

Our  measured  average  skew  of  —0.02  and  kurtosis  of  2.81  are  so  close  to  the  corresponding 
Gaussian  values  of  0  and  3,  respectively,  that  our  data  can  be  effectively  modeled  as  Gaussian 
at  each  pixel.  By  computation  of  the  sample  covariance  of  brightness  between  image  pixels, 
our  experiments  also  indicate  that  the  brightness  measurements  are  statistically  independent 
across  the  pixels. 

Let  vector  I  represent  image  7(x,  y)  where  the  rows  of  the  image  are  concatenated  into  one 
column  vector  in  lexicographic  order.  Each  component  h  of  vector  I  contains  an  independent 
intensity  measurement  7(x,  y)  for  1  <  k  <  MN .  Then  the  probability  density  for  I  is 

exP  ("2^  £<'*  -  m‘)2j  •  <*> 
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3  Recognition  as  a  parameter  estimation  problem 


We  use  the  six- dimensional  vector  a  =  (x0,yo,d0,sx,syia)  to  describe  rigid  body  motion 
and  linear  distortion  of  an  object  q  in  an  image  with  position  xo  =  (x0,y0),  rotation 
contractions  sx,  sy,  and  skew  a  which  vanishes  in  a  rectangular  Cartesian  coordinate  system. 
For  example,  suppose  the  general  Cartesian  coordinates  (x',  y')  are  related  to  the  rectangular 
Cartesian  system  (x,  y)  by  the  2-D  affine  transformation 
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which  can  be  expressed  more  succinctly  as  x'  =  Ax  —  xo,  where 

Sx  0  ^ 

0  Sy  J 


A  = 


(3) 


cos  do  sin  $o 

—  sin(0o  +  a)  cos(0o  +  a) 

A  model  object  q(x',y')  in  some  ideal  reference  frame  (x',y'),  therefore,  appears  as  a  trans¬ 
lated,  rotated,  contracted  and  skewed  object  q(x ,  y:  a)  in  the  covariant  reference  frame  (x,  y) 
of  an  image.  The  parameters  a  are  then  measured  within  the  image  reference  frame  such 
that  — oo  <  Xo,yo  <  oo,  0  <  Oo  <  27t,  — 7t/2  <  a  <  7t/2,  and  0  <  sx,sy  <  oo,  where  dilations 
occur  for  0  <  sx,sy  <  1  and  contractions  for  1  <  sx,sy. 

To  account  for  the  possibility  that  distinct  objects  may  have  coincident  vectors  a  we 
define  an  additional  parameter  v  that  identifies  the  class  of  the  object.  For  example,  in 
traffic  sign  recognition,  a  “slow”  sign  is  in  a  different  class  from  a  “yield”  sign,  although  the 
two  may  have  the  same  a. 

From  the  perspective  of  statistical  estimation  theory,  recognizing  an  object  is  the  same 
as  estimating  the  parameters  a  and  v. 


4  Parameter  resolution:  Fisher  information,  recognizability,  and  the  coherence 
of  objects  in  images 

Let  us  consider  the  problem  of  recognizing  an  object  of  a  given  class  in  some  scene.  This  can 
equivalently  be  posed  as  the  problem  of  estimating  the  parameter  vector  a  given  the  image 
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data  I.  In  this  case,  the  likelihood  function  for  a,  given  the  image  data  I,  is 


1  /  i  MN  \ 

P(Xla)  =  (27r(72)A/iV/2  eXP  (-^5  "  ™*(a))2J  (4) 

where  the  mean  m^(a)  explicitly  depends  on  the  parameters  to  be  estimated.  The  form  of 
this  likelihood  function,  given  our  CCD  data,  is  very  different  from  that  in  active  radar, 
laser,  and  sonar  imaging  where  nonlinear  speckle  noise  is  found  [13]. 

The  lower  bound  on  the  mean-square  error  in  any  unbiased  estimate  a  can  be  expressed 
as 

E[(a  —  a)(a  —  a)T]  >  J-1,  (5) 


where  the  Fisher  information  matrix  J  is  defined  by 


Jij  —  E 


d 2 


ddiddj 


lnP(I|a) 


J_  v-1  v-1  ( dm(x,y;a)dm(x,s/;a)\ 

da f  j'  W 


Here  the  image  mean  m(x,y;  a)  only  depends  on  the  parameter  vector  a  for  those  pixels 
(x,  y )  €  0+  that  constitute  the  expected  object  q(x ,  y,  a)  and  any  neighboring  pixels  that 
are  affected  by  small  changes  in  a.  The  Fisher  information  matrix,  therefore,  can  be  reduced 
to 


J  — — 
-  a2 


E 


dq{x,y;a)dq(x,y,  a) 


da; 


da j 


(x,l /)€0+ 

It  is  significant  that  any  of  the  diagonal  entries  of  the  bound  can  be  expressed  as 


E[(a;  —  a,)2]  >  [J-1];,-  =  l\ 


l  5 


(7) 

(8) 


where  the  object  energy 


and  the  coherence  scale 


E  =  J2  k(*,y;a)|2 

(x,y)eO 


(9) 

(10) 


for  parameter  a,-  are  physical  descriptors  of  the  object  which  are  invariant  only  under  rigid 
body  motion.  The  coherence  scale  l{  measures  the  sensitivity  of  the  object  to  variations 
in  parameter  a{  and,  therefore,  can  be  interpreted  as  the  width  of  the  object’s  autocorre¬ 
lation  peak  over  lags  in  a,-.  An  object  with  relatively  high  sensitivity  to  parameter  a,-,  for 
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example,  will  have  a  relatively  narrow  autocorrelation  peak.  The  error  in  estimating  param¬ 
eter  a,-,  therefore,  increases  with  the  corresponding  object  coherence  scale  £{  and  additive 
noise  variance,  but  decreases  with  object  energy. 

When  all  parameters  are  uncoupled  and  J  is  diagonal,  the  product  of  na  coherence 
scales  i\  •  •  •  £Ua  yields  a  coherence  volume  that  is  a  scalar  measure  characterizing  the  combined 
na-dimensional  variations  of  the  object,  where  na  is  the  length  of  a.  More  generally,  we  define 
the  coherence  volume  V  in  terms  of  the  determinant  |J|  of  the  Fisher  information  matrix  by 


The  lower  bound  can  then  be  written  as 

J-  =  J W  V2,  (12) 

where  Ja #  is  the  adjugate  matrix  of  J  [21]. 

These  coherence  scales  have  compelling  physical  meanings  that  will  be  discussed  in  the 
remainder  of  this  section. 

From  the  computer  vision  perspective,  we  consider  the  interpretation  of  J  as  an  informa¬ 
tion  measure  to  be  far  more  useful  than  its  interpretation  as  the  inverse  of  the  theoretical 
lower  bound  on  estimation  error.  Our  approach  and  purpose  therefore  stands  apart  from 
Cernuschi-Frias  et  aids  [4],  For  example,  in  the  type  of  optical  pattern  recognition  problems 
encountered  with  low-variance  CCD  camera  measurements,  the  associated  bounds  on  object 
positional  resolution  fall  in  the  sub-pixel  regime,  and  are  somewhat  of  an  overkill.  On  the 
other  hand,  because  the  volume  |  J|  of  Fisher  information  is  inversely  proportional  to  the  lim¬ 
iting  mean-square  resolutional  volume  of  the  parameters  that  uniquely  specify  the  object, 
we  consider  it  to  be  a  scalar  measure  of  the  object’s  recognizability  in  a  given  image.  By  Eq. 
12  it  is  seen  that  there  is  a  direct  relationship  between  this  recognizability  measure  and  the 
physical  components  of  the  Fisher  information,  namely,  the  object’s  coherence  volume  and 
energy.  For  example,  within  a  given  image,  where  the  additive  noise  variance  is  uniform, 
the  information  volume  |J|  only  varies  with  the  object’s  coherence  volume  and  energy.  The 
noise  variance,  therefore,  factors  out  under  variations  in  object  recognizability,  regardless  the 
noise  level.  This  shows  that  it  is  the  physical  structure  of  the  likelihood  function  and  not  the 
level  of  the  noise  that  is  most  important  in  properly  formulating  the  recognition  problem. 
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4.1  Position  resolution 


We  first  derive  the  lower  bound  on  the  error  for  any  unbiased  position  estimate  of  an  object 
with  known  rotation,  contraction  and  skew.  Given  the  true  position  (a1:a2)  —  (x0,  yo),  the 
Fisher  information  matrix,  with  elements 


Jij  — 


M-IN- 1 

EE 

x-0  y-0 


'dgjx  -  x0,  y  -  y0)  dg(x  -  x0,  y  -  y0) ' 


da i 


daj 


(13) 


can  be  expressed  by  a  spatial  “bandwidth  matrix”  B  =  <j2/EJ  that  characterizes  the  object. 
To  do  so,  it  is  convenient  to  let  the  double  sum  in  Eq.  13  be  replaced  by  a  continuous  double 
integral  so  that  q(x,  y)  and  Q{u,  v)  can  be  defined  as  Fourier  transform  pair 


Q(u,v)  =  J J  q(x,y)e-i2’^',+^dxdy 


(14) 


and 

q(x,y)  =  r  r  Q(u,v)e*^dudv  (15) 

J— oo  J— oc 

where  dxdy  =  (Ax)2  is  the  pixel  area.  The  four  elements  of  B  can  then  be  defined  by  a 
mean-square  bandwidth  B 2  in  x, 

B*  =  Me  Uymu'v) ?iuiv-  (16) 

a  mean-square  bandwidth  B2  my, 

n  ( 2x1 2  r°°  f°° 

B ?  “  Me  LLv  'QMl  dudv-  <1?) 

and  a  cross-term 

Bly  =  Blx  =  J_e0f_00uv\Q(u'v) \2<iudvi  (18) 

with  the  aid  of  Parseval’s  Theorem 


(A x)2E  =  f  f  \q(x,y)\2dxdy  =  [  f  \Q(u,v)\2dudv.  (19) 

J  JO  J —oo  J —oo 

These  definitions  for  the  object’s  mean-square  spatial  bandwidth  are  similar  to  those  intro¬ 
duced  for  one- dimensional  signal  waveforms  by  Gabor  [8].  A  distinction  lies  in  the  positive- 
semidefinite  nature  of  our  object  brightness  data  versus  the  zero-mean  nature  of  modulated 
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signal  waveform  data.  As  a  result,  our  mean-square  bandwidths  are  defined  about  zero  spa¬ 
tial  frequency,  as  in  Ref.  [13],  while  those  in  the  signal  processing  literature  are  defined  about 
some  average  frequency  that  approximates  the  carrier  frequency  for  narrowband  signals. 

Given  these  definitions  and  the  derivative  rule  for  Fourier  transform  pairs,  the  lower 


bound  on  position  recognition  can  be  expressed  as 

.  a2  ,  a2  (  B2  -B2  \  „ 

Bl)  A™' 

(20) 

where 

=  |B|-i 

(21) 

is  the  coherence  area  of  the  object,  which  follows  from  Eq.  11,  where  V  — 
scenario.  For  example,  the  lower  bound  for  estimating  x0  is  simply 

Axa.yc  for  this  2-D 
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"bl^ 

II 

Iff 

Al 
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o 

pq 

(22) 

where  coherence  length  scale  40  equals  B2yj |B|  or  B2y  A2XQ<yo,  and  the  lower  bound  for  y0  is 

E[(yo-S,o)2]>.C=^C  (23) 

where  £yo  equals  B2X  A2Xoyo.  This  analysis  provides  a  2-D  extension  of  the  well-known  relation¬ 
ship  between  a  1-D  signal’s  mean-square  bandwidth  and  the  optimal  resolution  attainable 
in  an  estimate  of  its  position  [5].  While  the  coherence  length  scales  lXQ  and  £yo  could  have 
been  obtained  directly  from  Eq.  10  without  introducing  the  mean-square  bandwidth  con¬ 
cept,  this  would  have  circumvented  both  the  historical  perspective  and  an  important  physical 
interpretation. 

The  coherence  areas  and  coherence  length  scales  of  two  traffic  signs,  a  stop  sign  and 
a  European  no-entry  sign,  are  compared  in  Figure  2.  The  stop  sign  has  a  much  smaller 
coherence  area  than  the  other  sign.  Its  position,  therefore,  can  be  resolved  much  more  easily. 

The  bound  on  position  estimation  error  is  not  invariant  to  changes  in  object  rotation,  as 
is  shown  in  Appendix  B  by  principal  component  analysis. 


8 


Auto-correlation  R(x) 


x-coordinate  of  auto-correlation 


Figure  2:  Above,  the  images  of  two  traffic  signs  and  their  2D-autocorrelation  surfaces  are 
shown.  The  white  centers  of  the  autocorrelation  surfaces  correspond  to  the  coherence  areas 
of  the  signs.  The  European  no-entry  sign’s  coherence  area  of  2.2  %  of  the  sign’s  area  is  much 
larger  than  the  stop  sign’s,  which  is  0.4  %.  This  indicates  that  the  position  of  the  stop  sign 
can  be  resolved  more  easily  than  the  position  of  the  European  no-entry  sign.  Below  are  ID- 
horizontal  slices  through  the  center  of  the  signs’  autocorrelation  surfaces,  where  ^-positions 
are  fixed  and  x-positions  vary.  The  stop  sign’s  horizontal  position  can  be  resolved  better 
than  the  European  no-entry  sign’s  because  of  its  narrower  autocorrelation  peak-width  and 
shorter  coherence  length. 
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4.2  Angular  resolution 


Next,  assume  that  only  the  rotation  6q  of  the  object  about  some  point  in  the  image  plane  is 
unknown.  By  Eq.  10,  the  angular  coherence  scale  for  object  rotation  is 


This  leads  to  the  bound 


te0  = 


E 


\  S(a7,y)£0  + 


9g(x,y) 

30o 


E[(«o  -  «o)2]  >  41 


(24) 


(25) 


on  angular  resolution  of  the  object,  which  is  invariant  to  changes  in  object  position,  since 
and  I?  vanish,  but  depends  on  contraction  and  skew  of  the  object,  since  E  and  £0O  are 
functions  of  sx,sy,  and  a. 


The  angular  coherence  scales  of  a  stop  sign  and  a  European  no-entry  sign  are  compared 
in  Figure  3.  The  European  no-entry  sign  has  greater  circular  symmetry  and  therefore  has  a 
wider  angular  autocorrelation  peak  and  correspondingly  larger  angular  coherence  scale  than 
the  stop  sign. 


Auto-correlation  R (angle) 


-180  -135  -90  .  -45  0  45  90  135  180 

Rotation  angle  in  degrees 


Figure  3.  Comparison  of  angular  coherence  length  scales  £$0:  The  European  no-entry  sign’s 
autocorrelation  peak  is  much  wider  than  the  stop  sign’s,  indicating  that  its  rotation  is  more 
difficult  to  resolve. 
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4.3  Contractional  resolution 


Finally,  assume  that  only  the  object’s  contractional  distortions  sx  and  sy  are  unknown. 
Then,  for  2-D  parameter  vector  (01,02)  =  (sx,sy),  where  $x,  sy  >  0,  J  is  a  2  x  2  matrix  with 
elements  defined  in  Eq.  7.  The  coherence  area  ASxtSy  and  coherence  length  scales  £Sx,  £Sy 
are  then  dependent,  by  Eq.  11,  on  both  diagonal  and  cross  terms  of  the  Fisher  information 
matrix,  such  that 


The  bounds  for  contractional  resolution  are  then 

E[(sx-sxf}  >  ^s2x,  (28) 

and 

mv-Sy)2]  >  ^4.  (29) 

While  these  bounds  are  invariant  to  changes  in  object  position,  they  are  invariant  to  changes 
in  object  rotation  only  when  the  contractions  sx  and  sy  are  equal. 

The  contractional  coherence  areas  and  scales  of  three  signs  are  compared  and  related  to 
the  respective  autocorrelation  peak  widths  in  Figure  4. 

5  The  complexity  of  imaged  objects 

According  to  standard  usage,  an  object  is  considered  to  be  complex  if  it  is  “composed  of 
elaborately  interconnected  parts.”  We  may  gather  from  this  that  as  complexity  increases  so 
does  the  number  of  interconnected  parts.  These  ideas  can  help  us  formulate  a  quantitative 
definition  for  the  complexity  of  an  imaged  object. 

Let  us  first  consider  two  objects  of  exactly  the  same  dimensions  but  of  different  com¬ 
plexities  that  are  imaged  in  an  otherwise  empty  scene.  For  example,  let  the  more  complex 
object  be  a  grey-scale  Mona  Lisa  without  a  picture  frame,  the  less  complex  object  be  a  blank 
white  canvas  of  the  same  dimensions,  and  the  empty  background  be  solid  black.  Because 
of  their  like  dimensions,  the  two  objects  occupy  the  same  overall  area.  As  may  be  inferred 
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0.55  0.75  1  1.5  2  345 

Contraction  s_x  =  s_y 


Figure  4:  Above,  the  autocorrelation  surfaces  of  model  signs  European  no-entry,  Stop  and 
Priority  are  shown  with  contraction  parameters  sx  and  sy  increasing  from  the  lower  left  to  the 
top  right  of  the  surfaces.  The  white  centers  of  the  autocorrelation  surfaces  are  the  correlation 
peaks  and  correspond  to  the  contractional  coherence  areas  of  the  signs.  The  European  no¬ 
entry  sign’s  contractional  coherence  area  is  much  greater  than  the  Stop  sign’s,  which  means 
that  the  contractional  parameters  sx  and  sy  are  easier  to  resolve  for  the  Stop  sign.  Below, 
1-D  diagonal  slices  of  the  autocorrelation  surfaces  are  shown  along  the  diagonal  sx  =  .sy. 
Since  the  peak  of  the  European  no-entry  sign’s  autocorrelation  is  much  wider  than  the  stop 
sign’s,  the  stop  sign’s  size  is  much  easier  to  resolve  than  European  no-entry  sign’s. 


from  their  descriptions,  however,  the  two  objects  have  vastly  differing  coherence  areas.  Let 
us  regard  a  coherence  area  as  small  if  the  ratio  of  it  to  the  overall  object  area  is  much  less 
than  1.  Then,  for  example,  the  Mona  Lisa’s  coherence  area  will  be  small,  due  to  its  large 
number  “of  elaborately  interconnected  parts,”  but  the  number  of  coherence  areas  or  cells 
that  fit  into  the  Mona  Lisa’s  overall  area  will  be  large.  Conversely,  the  coherence  area  of  the 
blank  canvass  will  not  be  small,  but  the  number  of  coherence  cells  that  fit  into  the  blank 
canvass’  overall  area  will  be  near  unity.  We  may  consider  the  overall  object  area  as  a  kind 
of  outer  scale  and  the  coherence  area  as  a  kind  of  inner  scale  for  variations  in  an  object’s 
2-D  position.  It  is  the  ratio  of  such  an  outer  scale  to  an  inner  scale  that  determines  the 
number  of  coherence  cells  in  the  object,  also  referred  to  as  its  degrees  of  freedom,  which  can 
be  interpreted  as  its  gain  in  sensitivity  under  transformation  over  the  empty  object  space. 
By  the  foregoing  argument,  this  ratio  also  serves  as  a  quantitative  measure  of  an  object’s 
complexity. 

Generalizing  these  concepts,  we  define  the  outer  volume  under  affine  transformation, 
denoted  by  5,  to  be  the  object  area  times  27r2.  This  is  the  product  of  the  outer  scales  for 
2-D  positional  transformation,  rotation,  2-D  contractions,  and  skew  that  are,  respectively, 
the  object  area  A,  27r,  unity,  and  r.  The  complexity  of  an  object  under  affine  transformation 
is  then  the  ratio  of  this  outer  volume  to  the  coherence  volume  V  defined  in  Eq.  11,  so  that 

C  =  |  =  A2jt2  (j'J  1  |J|i.  (30) 

The  complexities  of  various  traffic  signs  are  compared  in  Figure  5.  As  may  be  expected 
from  a  qualitative  perspective,  signs  with  inscriptions  and  human  figures  have  much  higher 
complexities  than  signs  composed  only  of  simple  geometric  shapes.  Our  data  analysis  will 
later  show  that  the  ability  to  unambiguously  resolve  such  an  object  increases  with  the  object’s 
complexity. 

When  the  affine  transformation  is  reduced  to  a  2-D  translation,  the  relevant  •positional 
complexity  becomes 

^*o,yo  =  ~a  i  (31) 

where  the  coherence  area  AXOtya  is  given  in  Eq.  21.  When  the  translation  is  restricted 
to  a  single  dimension,  the  above  complexity  becomes  analogous  to  that  used  in  the  signal 
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Complexities  of  Signs 


Figure  5:  Comparison  of  complexity  C  for  various  traffic  signs:  Signs  with  inscriptions  and 
human  figures  have  higher  complexity  than  signs  composed  only  of  simple  geometric  shapes. 

processing  literature  for  the  analysis  of  complex  waveforms  [5,  20]. 

Similarly,  we  define  the  rotational  complexity  of  an  object  by 

C»o  =  J-,  (32) 

and  the  contractional  complexity  by 

Cs  =  —  ,  (33) 

^SX,Sy 

where  the  rotational  coherence  scale  £q0  is  defined  in  Eq.  24  and  the  contractional  coherence 
area  ASxiSy  in  Eq.  26.  These  positional,  rotational,  and  contractional  complexities  of  the 
traffic  sign  models  are  plotted  in  Figure  6  and  are  consistent  with  qualitative  appraisals  of 
the  inherent  positional,  rotational  and  contractional  symmetries  of  the  signs. 

6  Image  edges 

There  is  an  important  connection  between  the  positional  Fisher  information  of  an  object  and 
“edge-based”  recognition.  Both  require  computation  of  the  spatial  gradient  ( dq^v^ ,  ) 

of  the  expected  object.  By  Eq.  13,  however,  the  positional  Fisher  information  integrates 
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Figure  6:  The  positional,  rotational,  and  contractional  complexities  of  the  traffic  sign  models. 


15 


gradient  factors  over  the  entire  object.  This  includes  both  slowly  varying  brightness  contri¬ 
butions  over  the  entire  area  of  the  object  as  well  as  rapid  variations  at  edges  that  comprise  a 
relatively  small  fraction  of  the  object’s  overall  area.  A  priori,  there  is  no  way  to  judge  which 
of  these  will  make  the  dominant  contribution  to  the  Fisher  information.  In  spite  of  this  basic 
fact,  edge- based  recognition  methods  threshold  the  gradient  magnitude  over  the  object  so  as 
to  discard  all  information  pertinent  to  the  object’s  recognizability  that  is  not  contained  in  its 
edges.  The  danger  in  edge-based  methods,  therefore,  is  that  a  potentially  larger  amount  of 
information  may  come  from  slowly  varying  brightness  changes  accumulated  throughout  the 
object  s  area  than  from  rapid  changes  at  edges.  In  this  case,  edge-based  recognition  methods 
are  inherently  sub-optimal.  Conversely,  if  the  predominant  positional  information  about  an 
object  is  concentrated  in  its  edges,  the  analysis  of  Fisher  information,  coherence  scales  and 
complexity  remains  equally  pertinent  regardless  of  the  method  of  recognition.  Moreover, 
the  foregoing  analysis  goes  beyond  consideration  of  positional  variations,  as  expressed  in 
terms  of  the  horizontal  and  vertical  gradient  components  also  used  in  edge  methods,  but 
also  accounts  for  the  general  linear  variations  permissible  in  an  affine  transformation. 

Figure  7  illustrates  the  similarities  and  differences  between  a  model  sign  with  information 
concentrated  at  distinct  edges,  its  derivatives  with  respect  to  various  affine  parameters,  and 
its  edge  maps. 


Figure  7:  A  stop  sign,  its  partial  derivatives  with  respect  to  x,  y,  6 ,  and  s  =  sx  =  sy,  and 
its  edge  map  and  thresholded  edge  map. 


7  Maximum  likelihood  estimation  of  an  object  in  a  scene  image 

In  this  section,  we  derive  a  method  of  recognizing  an  object  in  a  complex  scene  that  attains 
the  theoretical  lower  bound  on  mean-square  error  for  any  unbiased  estimate,  and  therefore 
is  by  definition  statistically  optimal  and  information-conserving. 
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Given  the  image  data  I,  and  following  classical  estimation  theory,  we  use  the  likelihood 
function  of  Eq.  4,  to  derive  the  maximum  likelihood  estimate 


B-ml  =  argmax  P(I|a) 

Si 


(34) 


of  the  parameters  a.  The  maximum  likelihood  estimate  bml  can  be  found  by  solving  the 
likelihood  equation 

dlnf(I|a),  „  .... 

Qa  la=a ml  ~  (35) 

Since 

Z)ln  P(1\a\  a  f  MN  \ 

(36) 


3a  da  2a2 

the  maximum  likelihood  estimate  a  ml  is  equal  to 
.  1 


1  MN  \  * 

a))  I  , 

^  k  J 


argimn  — 


S  (T(xiy)  ~  m(x,y))2  +  5Z  (i(x,y)  -q(x,y,*))2  ] ,  (37) 

(x,y)€B  (x,3/)eO+ 


where  region  B  consists  of  background  unrelated  to  the  object  while  region  0+  is  the  union 
of  all  pixels  that  contain  the  expected  object  q(x,y,  a)  as  well  as  a  slightly  perturbed  or 
variational  object  q(x,y;  a  +  Aa).  The  first  sum  in  Eq.  37  can  be  discarded,  because 
the  background  does  not  depend  on  the  object  properties  described  by  parameter  a.  The 
maximum  likelihood  estimate  is  then 


&ml  —  arg  nun  ^  (I(x,y)  -  q(x,y;B))2.  (38) 

(x,jOeo+ 

After  expanding  the  square,  this  reduces  to 

a  ml  ~  argmax  ^  -7(x,  t/)?(x,?/;  a),  (39) 

(x,y)eO+ 

because  the  data  energy  J2(x,y)eo+{Hxi  v)Y  is  always  independent  of  a,  and,  for  small  per¬ 
turbations  of  a  about  its  true  value,  the  expected  object  energy  £(r,y)eo+  y\ a))2  can  be 
taken  as  a  constant  independent  of  a. 

We  interpret  q(x,  y;  a),  in  Eq.  39,  as  a  multidimensional  matched  filter,  which,  when 
evaluated  at  some  particular  ao,  is  referred  to  as  the  replica  q(x,  y,  a0).  To  find  the  maximum 
likelihood  estimate  bml-,  therefore,  we  search  for  the  replica  that  best  matches  the  object 
in  the  image.  The  value  of  the  parameter  vector  that  corresponds  to  this  best  match  is  the 
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maximum  likelihood  estimate.  To  put  it  another  way,  we  seek  the  value  of  the  parameter 
vector  that  maximizes  the  output  of  the  multidimensional  matched  filter  given  by  the  sum 
in  Eq.  39. 

To  ensure  that  our  filter  is  not  biased  by  changes  in  either  the  data  energy  or  the  expected 
object  energy  within  the  local  replica  window,  we  employ  a  local  weighting.  The  output  of 
the  resulting  weighted  multidimensional  matched  filter 

r(a)  =  771^77 7i(A(a)  H  Ig(x,y)q(x,y,a)-mI(a)mq(a)),  (40) 

<Tl{a.)<Tq{&)  (*,y)gO 

quantifies  how  well  the  measured  data  in  subimage  Iq(x,y )  matches  the  replica  object 
in  q(x,y;  a).  Here  A( a)  is  the  number  of  pixels  in  the  replica  image  q(x,y;  a)  that  have 
nonzero  brightness,  and  therefore  constitute  the  replica  object,  while  0  is  the  region  that 
contains  the  replica  object,  as  illustrated  in  Figure  7.  The  local  variance  of  subimage  Iq(x,  y) 
is 

a2i(a)  =  A(a)  £  Iq{x,y)2  -  (  £  Iq(x,y) 

(x,y)eO  \(a7,y)GO 

the  variance  of  replica  image  q(x ,  y;  a)  is 

°f(a)  =  A(a)  ?(^y;a)2-f  J2  ?(z,2/;a))  , 

(x,y)eO  \(x,y)eO  / 

and  the  local  image  means  are  m/(a)  =  £(*>y)eo  /»(«,  y)  and  mg( a)  =  E(l,y)€o?(^2/;  a). 
It  is  noteworthy  that  the  weighted  multidimensional  matched  filter  is  dimensionless,  with 
|r(a)|  <  1  by  the  Cauchy-Schwartz  inequality,  so  that  scene  object  Iq  and  replica  object  q 
are  perfectly  correlated  when  r(a)  =  1. 

When  the  estimate  a  is  very  close  to  its  true  value,  small  changes  in  a  lead  to  negligible 
changes  in  mi,mq,cri,  and  crq,  so  that  these  sample  means  and  standard  deviations  may  be 
taken  as  locally  constant.  In  this  case,  the  weighted  matched  filter  of  Eq.  40  becomes  a 
linear  function  of  the  matched  filter,  the  sum  in  Eq.  39,  as  demonstrated  experimentally  in 
Figure  9,  so  that  the  value  of  a  that  maximizes  r(a)  is  the  maximum  likelihood  estimate, 
where 

a  ml  =  argmax  r(a).  (41) 
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Figure  8:  Scene  image  I(x,y)  with  subimage  Iq(x,y )  and  replica  image  q(x,y).  Since  the 
replica  object  may  not  be  exactly  rectangular,  the  portion  Iq(x,y)  of  the  scene  image  that 
does  not  overlap  the  object  replica  must  be  removed  from  the  match.  To  do  so,  the  m  x 
n  replica  image  q(x.  y)  is  set  to  zero  for  pixels  not  belonging  to  the  replica  object.  The 
computation  time  for  any  value  of  r  is  proportional  to  the  number  of  nonzero  pixels  A  in 
the  object,  which  is  usually  much  smaller  than  the  number  of  pixels  in  I. 

Therefore,  while  there  may  be  local  optima  in  the  weighted  matched  filter  output,  the  lo¬ 
cation  of  its  global  optimum  in  the  parameter  search  space  corresponds  to  the  maximum 
likelihood  estimate.  The  maximum  likelihood  estimate,  and  hence  the  weighted  matched 
filter,  asymptotically  attains  the  lower  bound  derived  in  Eq.  12,  and  is  therefore  optimal 
and  information-preserving  when  the  signal-to-noise-ratio  (SNR)  E/cr 2  is  high,  as  it  is  for 
the  present  recognition  problem  with  CCD  data.  An  analytic  proof  of  this  can  be  found  in 
Ref.  [14]. 

However,  a  more  practical  experimental  proof  of  our  method’s  statistical  optimality  is 
readily  provided  by  inspection  of  Figure  9.  Over  the  entire  global  peak,  the  weighted  matched 
filter,  a  comparison  of  noiseless  object  replicas  versus  noisy  image  data,  is  indistinguishable 
from  the  autocorrelation  of  the  coincident  noiseless  object  replicas.  While  such  a  result  is 
not  surprising,  nor  does  it  make  the  corresponding  sub-pixel  positional  error  bounds  more 
relevant,  it  nonetheless  confirms  that  our  approach  is  information-conserving,  and  therefore 
takes  full  advantage  of  all  the  image  data  pertinent  to  the  object’s  recognizability. 
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Figure  9:  Above,  a  scene  image  with  a  “one  way”  sign.  In  the  middle,  three  ambiguity 
surfaces  computed  for  all  possible  translations  of  the  “one  way”  sign  replica  with  fixed  angle 
and  scaling  parameters.  The  left  surface  is  computed  using  the  matched  (M)  filter  (Eq.  39), 
the  middle  surface  is  computed  using  the  weighted  matched  (WM)  filter  (Eq.  40),  and  the 
right  surface  is  the  autocorrelation.  The  correlation  peak  of  the  surfaces  is  a  white  spot 
located  in  the  upper  left  of  each  plot.  Below,  horizontal  and  vertical  slices  through  the 
global  peaks  of  the  ambiguity  surfaces.  The  left  graph  shows  slices  along  the  x-axis  of  the 
ambiguity  surfaces  with  the  j/-coordinate  fixed,  and  the  right  graph  shows  slices  along  the 
y-axis  with  the  ^-coordinate  fixed.  The  methods  converge  at  the  true  solution. 
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8  Brightness  invariance  of  flat  surfaces 


The  brightness  of  an  object  depends  on  its  reflectance  properties,  its  shape,  and  its  illu¬ 
mination.  In  particular,  the  scene  radiance  L  of  a  surface  patch  centered  at  world  point 
(X,  y,  Z )  is  proportional  to  the  image  irradiance  or  intensity  W  measured  at  the  corre¬ 
sponding  pixel  (z,y),  such  that 


W(x,y)  =  g  L{X,Y,Z ),  (42) 

where  g  is  a  function  of  parameters  of  the  imaging  system  [10].  Since  the  sensitivity  of  our 
imaging  system  is  uniform  over  the  whole  image,  we  can  assume  that  g  is  constant.  The 
imaging  scenario  is  illustrated  in  Figure  10. 


Figure  10:  The  image  irradiance  I(x,y )  is  a  function  of  the  corresponding  scene  radiance 
L(X,  Y,Z),  which  depends  on  the  source  direction  s,  the  viewer  direction  v,  the  surface 
normal  n,  and  the  scene  irradiance  E{. 


The  scene  radiance  is  related  to  the  object’s  bidirectional  reflectance  distribution  function 
(BRDF)  fr  and  the  source  irradiance  Ei  by 


Lr(X,Y,Z )  =  fr(s(X,  y,  Z),v(X,  Y,  Z),X,  Y,  Z)  Ei(s(X,  y,  Z)), 


(43) 
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where  spf,  Y,  Z)  is  the  direction  of  a  collimated  light  source,  and  v(X,  Y,  Z)  is  the  direction 
of  the  camera.  For  a  flat  surface,  however,  the  direction  of  the  collimated  source  is  constant 
over  the  object  such  that  s  =  s (X,Y,Z).  Under  the  benign  assumption  that  the  object’s 
reflectance  has  directional  properties  that  are  separable  from  its  spatial  properties,  we  have 

fT  (s,  v{X,  Y,  Z ),  X,  F,  Z)  =  /rl( s,  v(X,  Y,  Z))  g(X,  Y,  Z ).  (44) 

A  special  case  of  this  is  a  Lambertian  surface  where  /rl(s,v)  =  1  /tt  and  g(X,  Y^Z)  is  the 
albedo.  If  the  camera  is  at  least  a  few  object  lengths  away  then  its  directional  variations 
over  the  object  will  be  so  small  that  the  camera’s  direction  can  be  considered  constant  such 
that  v  =  v(Af,  Y,  Z).  Then  the  image  brightness  I  =  IT-7  becomes 

I  =  ct(T\X,Y,Z\  (45) 

which,  to  within  the  constant  factor 

cr  =  (p/rl(s,v)  £;(s))'7,  (46) 

is  invariant  to  changes  in  the  geometry  of  the  source,  receiver  and  object.  It  is  noteworthy 
that  in  the  case  of  a  Lambertian  surface,  the  above  result  is  valid  regardless  of  whether 
Z)  is  effectively  constant  or  not.  This  is  significant  because  many  real-world  surfaces 
exhibit  Lambertian  behavior,  and  traffic  signs  in  particular  are  designed  to  have  Lambertian 
reflectance  properties. 

By  distributivity,  these  results  are  easily  extended  to  a  hemispherical  distribution  of 
distant  sources,  such  as  the  sky,  so  that  the  image  brightness  of  the  flat  object  remains 
invariant  to  changes  in  the  geometry  of  the  source,  receiver  and  object  to  within  the  constant 
factor  cr. 

9  Recognition  of  flat  objects 

The  output  of  the  weighted  matched  filter,  given  in  Eq.  40,  is  invariant  to  linear  trans¬ 
formations  of  image  brightness  of  the  form 

I'{x,y)  =  dl(x,y)  +  c2, 
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(47) 


where  c\  and  c 2  are  scalar  constants  as  shown  in  Appendix  C.  But  the  analysis  of  the 
previous  section  shows  that,  to  within  a  scalar  factor,  the  image  brightness  of  a  flat  object 
remains  invariant  to  changes  in  scene  shading  brought  upon  by  changes  in  the  geometry  of 
the  source,  receiver  and  object.  The  output  of  our  weighted  matched  filter,  therefore,  is 
invariant  to  such  changes  in  scene  shading,  as  is  our  optimal  estimate  of  the  parameters  a, 
and  as  is  necessary  for  object  recognition. 

Uncooperative  conditions  such  as  strong  shadows  and  occlusion  can  change  the  image 
irradiance  nonuniformly  and  will  cause  problems  with  recognition  if  they  are  not  accounted 
for.  However,  this  is  not  an  exclusive  weakness  of  our  present  formulation,  since  any  ap¬ 
proach,  for  example,  systems  based  on  contour  or  edge  detection,  will  have  difficulties  in 
such  unpredictable  and  adverse  situations. 

10  The  traffic  sign  recognition  system 

Our  method’s  performance  has  been  evaluated  experimentally  by  applying  it  to  the  problem 
of  recognizing  traffic  signs.  This  application  is  very  valuable  for  intelligent  vehicles,  which 
can  use  the  recognition  results  to  adjust  their  speeds  or  localize  themselves  in  their  environ¬ 
ments  [2].  There  are  several  conference  papers  on  traffic  sign  recognition,  for  example,  Refs. 
[1,  6,  11,  17,  19,  25].  Our  first  results  were  published  in  Ref.  [3].  Our  method  stands  apart 
from  previous  approaches,  because  it  is  not  restricted  to  edge  detection  as  Refs.  [1,  6,  17], 
and  it  does  not  rely  on  color  information  as  Refs.  [11,  19,  25].  In  principle,  our  approach 
could  be  extended  by  parameterizing  color  information. 

An  overview  of  our  traffic  sign  recognition  system  is  shown  in  Figure  11.  The  system  has 
a  library  of  replica  models,  one  for  each  traffic  sign  class,  which  are  input  along  with  a  scene 
image.  It  outputs  a  description  of  the  recognized  traffic  sign  in  the  scene  image  or  concludes 
that  the  scene  does  not  contain  a  traffic  sign.  The  system  contains  three  components:  a 
replica  generator  discussed  in  Section  11,  a  weighted  matched  filter  discussed  in  Section  7, 
and  a  parameter  perturbation  component  discussed  in  Section  12. 

The  recognition  process  starts  by  choosing  an  arbitrary  model  class  v  and  an  initial 
parameter  vector  a  randomly.  The  replica  generator  uses  the  initial  guess  of  a  to  transform 
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the  model  image  into  replica  image  q(x,  y\ a,  v).  This  replica  is  then  used  by  the  weighted 
matched  filter  r(a)  to  quantitatively  evaluate  the  match.  If  the  match  is  poor,  meaning  it  does 
not  satisfy  a  predetermined  threshold,  the  parameter  vector  a  is  perturbed  using  simulated 
annealing,  a  standard  nonlinear  optimization  method.  With  the  perturbed  parameter  vector, 
a  new  replica  is  created  and  tested. 

The  process  of  perturbing  and  evaluating  a  is  iterated  for  a  fixed  amount  of  time  and 
then  repeated  for  all  sign  classes.  If  the  best  matching  replica  image  q(x ,  y;  a*,  i/*)  among  all 
parameters  a  and  classes  v  correlates  highly  with  the  scene  image,  meaning  it  surpasses  a 
predetermined  threshold,  it  describes  the  recognized  object  and  is  the  output  of  the  system. 
Otherwise,  the  system  outputs  that  no  traffic  sign  was  found  in  the  scene. 


Models  Scene 


Figure  11:  The  traffic  sign  recognition  system. 


11  Generating  replicas  from  model  images 

For  efficiency  reasons,  we  use  five  parameters  (xo,yo,0o,sx,sy)  to  approximate  the  affine 
transformation  defined  in  Eq.  3.  The  skew  parameter  a  is  not  varied,  but  set  to  zero.  This 
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is  a  valid  approximation  of  transformations  that  traffic  sign  images  undergo,  because  the 
signs  are  generally  fronto-parallel  to  the  image  plane  or  tilted  by  not  more  than  45°  and  are 
far  away  from  the  camera  compared  to  their  sizes. 

Figure  12  shows  how  a  replica  image  is  generated  from  a  model  by  subsampling.  In  this 
example,  the  parameters  are  chosen  so  that  the  replica  consists  only  of  the  circled  pixels  of 
the  model  and  additional  zero-brightness  pixels.  For  other  parameter  choices,  a  four-point 
interpolation  may  be  necessary  to  compute  the  brightness  for  the  replica.  Examples  of  other 
replicas  are  shown  in  Figure  13.  Our  method  computes  the  replica  very  quickly  by  sweeping 
over  the  model  image  only  once.  The  time  for  creating  a  replica  image  from  a  n  x  m  model 
is  0(nm). 


Figure  12:  A  5  x  5  replica  image  is  obtained  from  a  9  x  9  model  image  using  contraction 
parameters  sx  =  sy  —  2  and  rotation  parameter  Oq  =  45°.  For  the  transformation,  the  pixels 
along  diagonal  vectors  tx  and  ty  are  rotated  by  45°  and  become  aligned  with  the  coordinate 
system  of  the  replica.  Similarly,  the  pixels  along  vectors  mx  and  my  are  rotated  by  45°  and 
become  aligned  with  the  diagonals  of  the  replica.  Pixels  of  zero  brightness  are  added  in  the 
replica  image  where  necessary. 


12  The  simulated  annealing  algorithm 

Since  the  space  of  possible  solutions  of  the  recognition  problem  is  extremely  large,  an  exhaus¬ 
tive  search  of  this  space  is  computationally  too  expensive.  We  therefore  base  our  recognition 
method  on  simulated  annealing,  a  popular  search  technique  for  solving  nonlinear  optimiza¬ 
tion  problems  [15,  12],  which  has  been  applied  to  many  computer  vision  problems,  e.g.,  [7]. 
Its  name  originates  from  the  process  of  slowly  cooling  molecules  to  form  a  perfect  crystal. 
The  analogue  to  this  cooling  process  is  an  iterative  search  process,  controlled  by  a  decreasing 
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=  sy  =  1/8,  0O  =  0° 


sx  =  1/16  s*  =  1/8 
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1/6 

sx  =  1/6 
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=  5/8 

20° 

0O  =  50° 

Figure  13:  Six  replica  images  of  a  “slow”  sign,  obtained  by  sampling  the  “slow”  sign  model 
at  various  sampling  rates  and  degrees  of  rotation. 


“temperature”  parameter.  At  each  iteration  j,  the  algorithm  generates  a  replica  (](x.  y:  a,  v) 
as  described  in  Section  11.  A  new  test  value  a[^t  for  parameter  a  at  iteration  j  is  created  by 

a[llt  =  «(i-1)  +  Aa(i),  (48) 

where  is  the  previous  value  of  a  and  step  Acd-d  is  a  random  variable  that  is  uniformly 

distributed  within  some  interval  [ — A,  A].  The  step  bound  A  is  determined  experimentally. 
To  properly  deal  with  image  boundaries  of  a  scene  image,  the  j- th  test  value  for  the  center 
(xo,  yo)  of  a  replica  of  width  wx  and  height  wy  is  computed  by 

attest  —  (®0  _1)  +  Ax^  -  mod  (m7  —  +  u//-1) 

Vo, test  =  {yo~1]  +  AySj)  -  u#-1))  mod  (nj  -  w^)  +  ^ 

This  definition  avoids  “attracting”  the  replica  to  the  rim  or  corners  of  the  scene  image  during 
the  search. 

At  each  iteration,  the  test  values  for  the  rotation  and  contraction  parameters  are  used 
to  create  a  replica  image  and  correlate  it  with  the  scene  image  at  the  test  location.  If 
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the  weighted  matched  filter  output  r[Ht  increases  over  the  previous  output  A3  *),  the  test 
parameter  values  are  accepted  since  a  better  match  is  found,  i.e., 

if  rtest  >  r(j_1)  then  a(j)  :=  a(£st  (50) 

for  each  parameter  a.  If  the  current  match  is  worse  than  the  previous  match,  i.e.,  r[Ht  < 
A3~x\  the  test  values  are  accepted  if 

rp(j~  1)  - 

exp (JL—-**)  >  (51) 

where  £  is  randomly  chosen  to  be  in  [0, 1],  is  the  temperature  parameter  in  the  j-ih 
iteration,  and  the  negative  exponent  corresponds  to  the  Boltzman  distribution  for  thermal 
equilibrium.  For  a  sufficient  temperature,  this  allows  “jumps”  out  of  local  maxima.  The 
cooling  schedule  for  the  j-th  update  of  the  temperature  parameter  is 

T&  =  T0/j  for  1  <j<L,  (52) 

where  To  is  the  initial  temperature  and  L  is  the  number  of  iterations  during  the  search. 
Equation  52  describes  the  fast  converging  inverse  linear  cooling  schedule  [23] .  See  Ref.  [22] 
for  a  thorough  comparison  of  annealing  algorithms  with  finite  length  cooling  schedules.  Since, 
after  L  iterations,  the  search  may  not  have  yielded  the  optimal  solution,  a  local  exhaustive 
search  is  conducted  around  the  best  solution  found.  The  best  result  of  the  local  search 
among  all  classes  describes  the  recognized  sign,  as  long  as  it  has  a  filter  output  that  lies 
above  threshold  6.  The  pseudo  code  of  the  recognition  algorithm  is  shown  in  Figure  14.  The 
behavior  of  the  parameters  during  a  typical  run  of  the  algorithm  is  shown  in  Figure  15. 

13  Experimental  results 

Our  data  consists  of  more  than  3280  scene  images,  a  few  of  which  are  shown  in  Figure  16. 
The  main  criterion  for  the  selection  of  the  scene  images  is  to  obtain  a  wide  variety  of  traffic 
sign  scenes,  originating  from  both  the  U.S.  and  Europe.  The  signs  in  the  scenes  have  different 
sizes  and  orientations,  are  illuminated  differently,  and  have  various  backgrounds.  Some  traffic 
signs  are  aged  and  bent,  some  are  painted  with  graffiti.  The  model  images  used  to  represent 
the  traffic  sign  classes  are  shown  in  Figure  19. 
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Recognition  Algorithm  (Scene  image  7,  set  of  models  A4) 


1. 

2. 

3. 

4. 

5. 

6. 

7. 

8. 

9. 

10. 

11. 

12. 

13. 

14. 

15. 

16. 

17. 

18. 

19. 

20. 
21. 
22. 

23. 

24. 

25. 

26. 

27. 

28. 


Initialize  recognition  threshold  6. 

Initialize  parameter  domains. 

Initialize  step  bounds  AXo,  Ayo,  ASx,  ASy,  and  Ae. 

For  each  model  class  v  G  M  do 

Initialize  position,  contraction,  and  rotation  of  replica,  e.g.,  at  random. 
Initialize  search  length  L  and  temperature  To. 

For  j  =  1  to  L  do 


Update  temperature  T^)  :=  T0/j. 

Pick  Xojest  and  yo)est  randomly  according  to  Eqs.  48  and  49. 


Evaluate  correlation  r\Ht  as  a  function  of 

xo,tesu  xo,test>  3x~*K  sy~^  and  according  to  Eq.  40. 

Choose  £  uniformly  at  random  within  [0, 1], 

If  exp(— -  rhlj/T^)  >  £ 

then  Xq)  :=  x$est,  :=  yo}est,  r(j)  :=  r^,  update  best  replica  g* 
else  x(0j)  :=  x yJj)  :=  := 

Pick  shiest,  s{y%st  and  0$e9t  randomly  according  to  Eq.  48. 

Create  new  replica  with  contractions  s[J}est,  s^}est  and  rotation  Obtest- 
Evaluate  correlation  rj^st  as  a  function  of 


Jj)  JJ)  M) 


„Ci) 


xo'->  Vo',  3 x, test,  Sy,test  and  ^ojest  according  to  Eq.  40. 
Choose  £  uniformly  at  random  within  [0, 1]. 

Jfexp(-(rW-rS2t)/rW)>^ 

then  sjp  :=  s(J>est,  stf)  :=  4fL,  4J)  :=  := 

update  best  replica  g*. 

Optimize  g*  by  small  local  parameter  perturbations 
Determine  replica  g*  with  highest  correlation  among  all  g*. 

If  correlation  for  q*  <  threshold  S 

then  output  “iVo  traffic  sign  found  in  image  7.” 

else  output  “ Traffic  sign  q{x,y,x*Q,  i yt,0*o,s*x,s*y-,v*)  found. ” 


(3) 
test -> 


Figure  14:  The  recognition  algorithm. 
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Figure  15:  Five  graphs  illustrating  the  behavior  of  the  filter  output  r  and  the  parameters 
xo,yo,0o  and  sy  during  a  typical  run  of  the  simulated  annealing  algorithm  with  a  stop  sign 
scene  as  input.  The  algorithm  is  run  with  the  initial  parameters  reported  in  Figure  18. 
The  size  of  the  search  space  is  ca.  60  million.  The  algorithm  takes  ca.  18  min.  on  a  Sun 
SPARC  station  5.  The  stop  sign  is  almost  recognized  at  iteration  2802,  but  the  temperature 
parameter  is  still  too  high  for  convergence.  The  sign  is  finally  recognized  at  iteration  4050, 
after  which  the  parameters  are  only  slightly  adjusted. 
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Figure  16:  Some  of  the  images  used  in  the  recognition  experiments. 
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The  performance  of  our  traffic  sign  recognition  system  depends  on  the  complexities  of 
the  signs  in  the  images.  The  system  recognizes  94%  of  the  traffic  signs  correctly  and  misclas- 
sifies  6%,  provided  that  the  signs  are  no  smaller  than  about  8%  of  the  scene’s  area.  These 
numbers  discount  mismatches  of  European  signs  with  their  corresponding  American  signs. 
For  example,  replicas  of  European  yield  signs  do  not  have  any  inscriptions,  but  correlate 
highly  with  scenes  of  American  signs  with  the  inscription  “yield.”  Figure  17  shows  some 
recognition  results,  including  scenes  with  occluded  signs  and  with  several  traffic  signs.  (The 
pseudo-code  in  Figure  14  is  easily  modified  in  line  25  so  that  several  signs  in  a  scene  can  be 
found.)  The  initial  parameter  values  used  in  the  simulated  annealing  algorithm  are  listed  in 
Figure  18. 
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overlying  the  circled  sign  in  the  image. 
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Recognition  threshold: 

6  =  0.6 

Search  length: 

L  =  7000 

Initial  temperature: 

T0  =  210 

Initial  position: 

(xo°\  2/o0>)  =  center  of  scene 

Initial  rotation: 

0(OO)  =  -4° 

Initial  contraction: 

replica  of  size  70  x  70  pixels 

Location  domain: 

(xo\  Vo^)  anywhere  in  scene 

Rotation  domain: 

e(0j)  €  [-10°,10°] 

Contraction  domain: 

m  x  n  replica  with  m,  n  €  [36  :  140]  pixels 

Step  bounds: 

AX0  =  Ay0  =  50  pixels,  ABo  -  5°, 
contraction  bounds  allow  ±10  pixel  steps 

Local  exhaustive  search: 

position  shift  by  ±2  pixels 
replica  contraction  by  ±2  pixels 
replica  rotation  by  ±1° 

Figure  18:  Initial  parameter  values  for  annealing  algorithm. 

Figure  19  shows  the  average  best  correlation  values  obtained  for  each  model  sign  in  the 
experiments.  The  top  graph  illustrates  the  average  best  correlation  for  correct  matches.  For 
example,  replicas  created  from  the  footpath  sign  model  match  correctly  with  scene  images 
that  contain  footpath  signs  with  an  average  correlation  of  0.78.  An  average  of  408  scene 
images  per  traffic  sign  was  used,  except  for  the  rare  “slow”  sign  for  which  only  a  few  images 
could  be  obtained.  Note  for  comparison  that  the  average  correlation  for  a  replica  that 
matches  with  an  arbitrary  scene  is  zero,  E[r]  =  0,  while  a  perfect  match  yields  r  =  1.  The 
dashed  graph  in  Figure  19  illustrates  the  average  best  correlation  for  scene  images  that  do 
not  contain  a  traffic  sign  in  the  correct  class  or  do  not  contain  a  traffic  sign  at  all.  For 
example,  the  best  correlation  of  a  stop  sign  replica  with  an  arbitrary  image  that  does  not 
contain  a  stop  sign  is  0.36  on  average.  The  average  is  taken  over  about  2200  scene  images 
per  sign  class. 

False  positive  matches  occur  when  the  best  correlating  replica  is  not  in  the  correct  class. 
Figure  20  shows  an  example  of  a  false  positive  match  where  the  no-entry  sign  to  be  recognized 
is  covered  by  graffiti  and  occluded  by  a  nonuniform  shadow.  Although  the  sign  in  the  scene 
can  be  found,  as  shown  in  the  left  image  in  Figure  20,  the  corresponding  match  yields  a 
filter  output  that  is  slightly  lower  than  the  output  due  to  the  best  matching  European  yield 
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Figure  19:  Recognition  results  for  the  nine  model  signs,  ordered  by  their  complexities  as 
in  Figure  5.  The  graph  plots  the  average  best  correlation  for  scenes  with  a  recognized  sign 
and  the  average  best  correlation  for  scenes  without  signs.  The  comparison  shows  that  the 
correlation  for  a  correct  match  is  high  enough  to  identify  the  correct  class  uniquely  among 
the  9  classes,  because  the  false  positive  correlations  are  always  lower  on  average. 
Underneath  the  sign  models,  a  table  lists  the  number  of  correct  positive  matches  and, 
for  scenes  without  corresponding  signs,  the  number  of  correct  negative  and  false  positive 
matches.  The  sum  of  the  entries  in  each  column  gives  the  total  number  of  images  used  for 
each  model  class.  False  positive  matches  only  occur  for  low- complexity  signs. 


33 


Correct  match,  r  =  0.51 


False  match,  r  —  0.56 


Figure  20:  Results  for  an  image  with  a  nonuniform  shadow.  On  the  left,  the  final  “no-entry” 
replica  is  shown  overlying  the  no-entry  sign.  On  the  right,  the  final  “yield”  replica  is  shown 
overlying  an  arbitrary  part  of  the  image. 

sign,  as  shown  in  the  right  image  in  Figure  20.  As  can  be  seen  from  the  data  in  Figure  19, 
the  European  no-entry  and  European  yield  models  generally  result  in  high  filter  outputs  for 
arbitrary  scenes  and  are  therefore  responsible  for  the  vast  majority  of  false  matches. 

The  poor  performance  of  the  European  no-entry,  European  yield,  and  priority  signs  is 
due  to  their  low  complexities,  as  defined  in  Eq.  30.  Figure  5  shows  that  their  complexities 
are  orders  of  magnitude  lower  than  the  complexities  of  signs  that  do  not  have  such  simple 
geometric  shapes. 

Traffic  signs  with  inscriptions  and  complicated  shapes  are  generally  more  complex  and 
therefore  more  sensitive  to  positional,  angular,  and  size  variations.  Such  signs  are  easier  to 
unambiguously  recognize  than  signs  with  low  complexities.  This  fact  can  be  used  a  priori  in 
evaluating  the  cross-class  performance  of  a  recognition  system. 

While  most  of  the  models  used  in  our  experiments  are  complex  enough  for  robust  recog¬ 
nition,  subsequent  downsampling,  as  is  necessary  in  replica  contraction,  can  be  detrimental. 
Additionally,  the  complexity  of  a  sign  in  a  scene  image  can  be  significantly  diminished  if  the 
sign  is  occluded  by  other  objects  in  the  scene.  We  find  that  the  higher  the  level  of  occlusion 
and  the  smaller  the  complexity,  the  more  difficult  the  recognition  of  the  sign  becomes. 


14  Summary  and  conclusions 


We  have  developed  a  general  method  for  object  recognition  that  is  information-conserving;  it 
attains  the  theoretical  lower  bound  on  estimation  error  for  any  unbiased  estimate  regardless 
of  the  method  of  estimation,  and  is  therefore  statistically  optimal.  Our  work  quantitatively 
shows  how  information  pertinent  to  an  object’s  recognizability  can  be  lost  by  many  ap¬ 
proaches  to  object  recognition  that  are  commonly  used  but  are  inherently  sub-optimal  from 
a  statistical  perspective.  Moreover,  our  theoretical  foundation  provides  a  framework  for 
quantitative  comparisons  between  different  recognition  methods  and  shows  under  what  spe¬ 
cial  circumstances  sub-optimal  techniques,  such  as  purely  edge-based  methods,  can  become 
optimal. 

We  have  applied  our  theoretical  results  to  develop  a  system  that  has  successfully  recog¬ 
nized  traffic  signs  in  thousands  of  complex  real-world  scenes. 

In  future  work,  we  will  extend  our  approach  to  nonplanar  3-D  objects,  using  physical 
models  [10,  24,  16]  that  describe  the  imaging  process  and  the  object’s  reflectance  properties. 
This  extension  of  our  work  will  lead  to  new  and  interesting  descriptors  that  characterize  the 
intrinsic  physical  properties  of  imaged  3-D  objects. 
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A  Signal-dependent  fluctuations  of  natural  light  are  negligible 


This  section  shows  that  thermally  induced  fluctuations  of  natural  light  are  not  a  significant 
cause  of  errors  in  our  measurements.  Natural  light  fluctuates  as  a  circular  complex  Gaussian 
random  (CCGR)  process  [9].  The  probability  density  of  an  M  x  iV  intensity  image  W 
measured  from  a  CCGR  field  is  the  gamma  distribution 


where  the  number  of  coherence  cells  p  in  the  intensity  average  is  defined  to  be  the  time- 
bandwidth  product  fi  =  TV,  where  T  is  the  coherence  or  measurement  time,  and  r  is  the 
bandwidth  of  the  light.  The  expected  value  of  Wk  is  cr*,  and  the  variance  of  Wk  is  cr|/)U. 
Given  Eq.  53,  the  probability  density  for  the  “gamma- corrected’.’  brightness  I  is 

- a  w  ® rr  exp  H)  •  <54) 


since  Ik  =  W£ .  For  notational  convenience,  the  subscript  k  is  dropped  in  the  following.  The 
expected  value  of  I  is 

'cr\  7  +  p) 


E[7]  =  jT°  W^P{W)dW  =  Q 


m  ’ 


and  the  variance  of  I  is 


var(J)  =  E[/2]  -  E[I}2  = 


_  'T  £M£(£  +  /*)  ~  (r(*  +  pp) 


(my 


An  approximation  of  the  mean  of  I  using  Stirling’s  Formula  that  T(p)  =  2e^(27r)2  yields 

.  (i  + 


E|/] 


cn 


fi~i  ^  2e^ 


1  2  _1  1 
zz  cn  e  e  a"* , 


which  holds  for  large  fi.  The  variance  of  I  can  be  approximated  by 


var(J) 


2 

(7  7 

M72 


and  is  therefore  a  function  of  the  mean,  which  reveals  that  the  noise  arising  from  circular 
complex  Gaussian  random  fluctuations  in  the  received  field  is  signal-dependent.  This  is 
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important  for  radar  and  sonar  imaging  [13],  where,  due  to  signal- dependent  fluctuation 
noise,  the  variance  of  high  intensity  measurements  can  be  larger  than  the  mean  of  low 
intensity  measurements.  For  fluctuations  of  natural  light,  however,  the  intensity  average  of 
the  measurements  is  large  enough  to  reduce  the  standard  deviation  to  a  negligibly  small 
fraction  of  its  mean,  as  shown  in  the  following  example  for  green  light. 

Green  light  has  a  bandwidth  of  Tgreen  =  3  x  10Sy(550nm  -  500ram)/(550nra  x  500rcm)  = 
5.45  x  1013  Hz.  With  exposure  time  of  T  =  1/100  Hz,  the  number  of  coherence  cells  is 
//  =  5.45  x  1011.  The  ratio  of  the  standard  deviation  of  I  to  the  mean  of  /  is  approximately 

std(/)  ^  a~  1  1 

E[J]  \  H'f2  VJrf1 

which  is  0(6  x  10-7),  a  negligibly  small  ratio  compared  to  that  actually  measured  in  Section  2. 
Therefore,  the  inherent  signal-dependent  fluctuations  of  natural  light  have  a  negligible  effect 
on  our  image  data. 


B  The  lower  bound  on  position  estimation 

This  section  analyzes  how  the  lower  bound  on  position  estimation,  as  derived  in  Section  4.1, 
varies  with  changes  in  object  rotation.  Let  R  be  a  two-dimensional  orthonormal  matrix. 
The  Fisher  information  can  then  be  expressed  in  terms  of  R  and  a  diagonal  matrix  D  that 
consists  of  R’s  principal  components  Dn  and  Z)2 2, 


£  E 


J  =  — B  =  — RDR  =  -Z-R 


E„  Dn  0 


0  D2 


where 


Dn  = 


Bl-Bl  1 


_  x  y 


+ + <®i  -  Biy 


Dn  =  _  1^4 5T  +  (B2  Z  £2)2.  (57) 

The  angle  9?  that  rotates  the  x-  and  y- axes  of  the  image  into  the  principal  axes  given  by  the 
object’s  bandwidth  matrix  is  defined  by 


tan(2<£>) 


Bl-Bl' 


The  lower  bound  on  position  recognition  is  therefore 

J-1  =  ^RD"XRT. 

ht 

The  lower  bound  on  recognizing  the  position  coordinate  xo  can  then  be  expressed  as 

2  j  CT2  Du  sin  V?  +  D22  cosy? 


(59) 


E[(x„  -  *0)2]  >  JZ'  = 
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DnD 


(60) 
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and  on  recognizing  the  coordinate  yo  as 


E[(#o  -  Jo)2]  >  J-'  = 


a2  Du  cos  y>  +  £>22  sin  y> 


(61) 


E  D11D22 

If  the  coordinate  axes  correspond  to  the  principal  components  of  the  bandwidth  of  the  object, 
i.e.,  j B2  =  Du  and  B2  =  D22,  then  the  error  in  the  x-  and  y-coordinate  of  the  position  is 
lower  bounded  by  <r2/(EB2 )  and  cr2/(EB2),  respectively.  Let  the  total  estimation  error  be 
the  Euclidean  distance  £  =  ^/E[(xo  —  x0)2]2  +  E[(y0  —  Vo)2}2-  The  total  error  is  then  lower 
bounded  by 

2  , _ 

£  >  +  ~D\ 2  +  2DuD22  sin  2y?.  (62) 

The  bound  is  smallest  if  the  principal  components  of  the  object’s  bandwidth  matrix  axe 
aligned  with  the  coordinate  axes,  i.e.,  y>  =  0ory>  =  7r/2.  The  bound  is  largest  for  y>  =  7t/4, 
for  which  E[(x0  -  x0)2]  =  E[(y0  -  yo)2]- 


C  Linear  invariance  of  the  weighted  matched  filter 

Given  the  definition  in  Eq.  40,  the  output  of  the  weighted  matched  filter 

_/  7  -  _  ,  _  \  A E  /«(*,  y)(c iq(x,  y )  +  C2)  -  (E  Iq{x,  y ))  ( E(ci?(* ,  y)  +  c2)) 

r{lq,  Ciq  +  c2 )  =  ,  —  , -  = 

V^E Iq(x, y)2  -  (E Iq(x, y))2 \J A E(ci q(x, y)  +  c2)2  -  (E(ci?(«, y)  +  c2))2 
describes  how  well  a  linearly  transformed  replica  C\q(x,  y)  +  c2  matches  with  the  measured 
data  in  subimage  /9(x,y).  The  numerator  of  r(/9,  c-^q  +  c2)  is 

ci  ( A  y)  -  Y,  hi*,  y)  J2  <i(x ?  2/))  > 

and  the  second  square  root  in  the  denominator  is 

(A  E(c2(q(x,  y))2  +  2cic2q(x,  y)  +  cj)  -  cJ(E  ?(»,  y))2  -  2cx Ac2  E  ?(*,  y)  ~  (Ac2)2)x/2, 


39 


which  yields 


Since 


r(Iq,ciq  +  C2) 


Cl (A £/,(»,  y)g(s,  y )  -  (E  J«(s,  y ))  (E  <?(*, y))) 

E  /?(x,  j/)2  -  (E  A(®,  y)f  ci  \J AY.{q{x,y))2  -  (J2q(x,y)f 


the  weighted  matched  filter  is  invariant  to  linear  transformations  of  image  brightness  and 
Eq.  47  holds. 
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